Using Location Information from Speech Recognition of Television News Broadcasts
نویسندگان
چکیده
The Informedia Digital Video Library system extracts information from digitized video sources and allows full content search and retrieval over all extracted data. This extracted ’metadata’ enables users to rapidly find interesting news stories and to quickly identify whether a retrieved TV news story is indeed relevant to their query. Through the extraction of named entity information from broadcast news we can determine what people, organizations, dates, times and monetary amounts are mentioned in the broadcast. With respect to location data, we have been able to use location analysis derived from the speech transcripts to allow the user to visually follow the action in the news story on a map and also allow queries for news stories by graphically selecting a region on the map. 1. The Informedia Digital Video Library Project The Informedia Digital Video Library project [1], initiated in 1994, uniquely utilizes integrated speech, image and natural language understanding to process broadcast video. The project’s goal is to allow search and retrieval in the video medium, similar to what is available today for text only. To enable this access to video, fast, high-accuracy automatic transcriptions of broadcast news stories are generated through Carnegie Mellon’s Sphinx speech recognition system and closed captions are incorporated where available. Image processing determines scene boundaries, recognizes faces [4][12] and allows for image similarity comparisons. Text visible on the screen is recognized through video OCR [5] and can be searched. Everything is indexed into a searchable digital video library [2][3], where users can ask queries and retrieve relevant news stories as results. The News-on-Demand collection in the Informedia Digital Library serves as a testbed for automatic library creation techniques of continuously captured television and radio news content from multiple countries in a variety of languages. As of January 1998, the Informedia project had about 1.5 terabytes of news video indexed and accessible online, with over 1600 news broadcasts containing about 40,000 news stories dating back to 1996 The Informedia system allows information retrieval in both spoken language and video or image domains. Queries for relevant news stories may be made with words, images or maps. Faces are detected in the video and can be searched. Information summaries can be displayed at varying detail, both visually and textually. Text summaries are displayed for each news story through topics and titles. Visual summaries are given through thumbnail images, filmstrips and dynamic video skims. Every location referenced in the news stories is labeled for geographic display on a map and the corresponding news item can be retrieved through a map area selection. The system also provides for extraction and reuse of video documents encoded in MPEG-1 format for web-based access and presentation. A multi-lingual component, currently implemented for Spanish and Serb/Croatian corpora, translates English language queries for text search into the target language. English language topics are also assigned to news stories. A user can add spoken or typed annotations to any news story, which are immediately searchable. News clips can be cut and pasted into HTML or PowerPoint presentations. 1.1 Speech Recognition Speech recognition in the Informedia Digital Video Library is done in two different passes. To get an initial transcript into the library as quickly as possible, a 20,000-word vocabulary version of the Sphinx-II recognizer is applied [6]. In a second pass, we use the slower, but more accurate Sphinx-III speech recognition system. In the 1997 DARPA broadcast news evaluations, the CMU Sphinx-III system achieved overall word error rate of 24% when multiple passes are applied [7]. However, this result was obtained at processing speeds of several hundred times real time. To make the speech recognition reasonably fast, we restrict the beam of the recognizer, resulting in a word error rate of about 34% on the broadcast news evaluation data. To obtain better performance, we use additional training data extracted from closed-captioned news transcripts for which the speech recognizer has a high confidence that the data is correct [9][10]. In addition, we also build a new language model every day, which interpolates a standard broadcast news transcription corpus [7][8] with current online web news reports and actual transcripts available on the CNN website (cnn.com). With help from both the improved acoustic models and by using the daily language model, we obtain a word error below 20%. Since we can parallelize the processing of a news story, the actual transcript data appears in our library within 2.5 hours from the broadcast time. 2. Location Analysis A named-entity extraction process is used to provide location data derived from the speech transcripts for geocoding and map displays and searches. Using a named entity tagger implemented as described in [13][14], we extract possible location phrases from the audio transcript. We also take the output of the video optical character recognition extraction [5], and extract possible location phrases from the video OCR. All these possible locations are then cross-referenced against a gazetteer of about 80,000 places and their locations [16], which include countries, cities, villages and states or provinces from all over the world. Currently excluded from this list are water areas, mountains and other non-political geographical data. Since we will ignore all locations that cannot be found in the gazetteer with a latitude and longitude, the accuracy of the named entity extraction process per se is not as critical. What is more relevant is the question whether we have identified the correct coordinates for a location. If the location is not in the gazetteer, then we have no chance of providing the coordinates of this location entity. If the words from the phrase describing the location were not in the speech recognition vocabulary, we again have no chance of providing the proper coordinates, even if we have properly identified the words as denoting a location. However, often locations are ambiguous in their coordinates. We then need to disambiguate different references to locations, e.g. “Washington” may refer to a number of cities in the United States, or it can refer to a state. A simple hierarchical disambiguation scheme has been implemented to distinguish among candidate coordinates: 1. If a location is determined to be ambiguous, then we first check if other location references within the current news story disambiguate among the alternate locations. 2. If the location is still ambiguous, we then see if other location references within a state or province favor a particular state. This way we can, for example, distinguish between different, initially ambiguous, references to Memphis in different states within the United States. 3. If there is no disambiguating evidence at the state/province level, we check to see if the location can be distinguished on the basis of country reference elsewhere in the news story transcripts. The mention of France in conjunction with Paris would, for example, distinguish Paris, Texas from Paris, France, under the assumption that either Texas or France would be mentioned elsewhere in the news story. 4. If the location is still ambiguous, we check for reference to locations within continents that might disambiguate the locations. 5. If the information is still ambiguous, we either discard it or choose the first of the alternative location interpretations. It is obvious that a more sophisticated approach would use a large amount of manually coded training data to help distinguish ambiguous locations. With sufficient amounts of training data, a HMM based approach along the lines of [14] would be feasible. Since we are not in a position to afford manual geocoding of large amounts of data, our simple approach appears to distinguish between different ambiguous locations. One notable exception is the disambiguation of "New York", as either a city or a state. Our approach frequently fails to distinguish between to two, for example in "The governor of New York drove to New York to meet with the mayor." Only if we have cues through the mention of "New York City" vs. "New York State" can we distinguish between these two alternative ambiguous location references. A big factor in the geocoding of information is the quality of the gazetteer. At present we have no automatic process for adding new locations and coordinates to our gazetteer, and any errors, misspellings or erroneous data must be corrected by hand. The location information is stored in a relational database, which associates the geocoded locations with specific news stories and time periods with the news story. This database can be queried textually by looking for all the location names as text strings, geographically by the coordinates of the locations and also through a lookup of the ID of a news story. In each case, a set of location names, coordinates and news stories identification tags are returned. Having the coordinates identified allows us to use the location database information in three ways: 1. Locations that occur more than once in a news story are added to the title information. Since automatically created titles [11] from speech transcripts are quite noisy, this helps identify major locations in the news story. 2. Locations can be dynamically displayed on a map, allowing the user to “follow along” with the geographic focus of a news story. As the video or audio for the relevant paragraph selection is playing, the locations in focus are highlighted. The dynamic map also serves as a static summary of the locations in the news story, by identifying all the locations in the news story at one glance. 3. We can use a query that graphically specifies a rectangle on the map to find all news stories that refer to any of the locations within this map area. The user can drag a rectangle over the Figure 1 shows an automatically generated map for a news detailing President Clinton’s trip to Africa in March of 1998. In the news story, the transcript talks about the various stops in Ghana, Senegal, Uganda, Rwanda, Botswana and South Africa. This screen snapshot was taken as the narrator detailed the events at Clinton’s next stop in Uganda. The system highlighted the country and any cities currently in focus, while showing less boldly the other countries and places mentioned in this news story. A user might then select an area on this map with the mouse, and initiate a spatial query to retrieve any news stories related to the specified locations Thus the location information allows both active maps, where the maps change according to the mention of the current location in the audio, and interactive maps, where the user makes a request by selecting an area on the map.
منابع مشابه
Semantically Enhanced Television News through Web and Video Integration1
The Rich News system for semantically annotating television news broadcasts and augmenting them with additional web content is described. On-line news sources were mined for material reporting the same stories as those found in television broadcasts, and the text of these pages was semantically annotated using the KIM knowledge management platform. This resulted in more effective indexing than ...
متن کاملLaughter extracted from television closed captions as speech recognizer training data
Closed captions in television broadcasts, intended to aid the hearing impaired, also have potential as training data for speech-recognition software. Use of closed captions for automatic extraction of virtually unlimited training data has already been demonstrated [1]. This paper reports some preliminary work on the use of non-speech sound tokens included in closed captions to extract training ...
متن کاملAT_TV: Broadcast Television and Radio Retrieval
This paper reports recent work at AT&T Laboratories Cambridge to develop retrieval systems for broadcast television and radio programmes. Unlike some other systems, it does not rely on manual classification or annotation of the broadcast material; it is indexed automatically from the air. While many digital video library projects focus solely on broadcast news, we have broadened our efforts to ...
متن کاملReal-time recognition of broadcast news
Although the performance of state-of-the-art automatic speech recognition systems on the challenging task of broadcast news transcription has improved considerably in recent years, many of the systems operate in 130-300 times real-time [1]. Many applications of automatic transcription of broadcast news, eg. closedcaption subtitles for television broadcasts, require real-time operation. This pap...
متن کاملAutomatic Transcription of English Broadcast News
In this paper the Philips Broadcast News transcription system is described. The Broadcast News task aims at the recognition of \found" speech in radio and television broadcasts without any additional side information (e.g. speaking style, background conditions). The system was derived from the Philips continuous mixture density crossword HMM system, using MFCC features and Laplacian densities. ...
متن کاملEnriching Textual Documents with Timecodes from Video Fragments
The OLIVE project aims the development of a multilingual indexing tool for broadcast material based on speech recognition, which automatically produces indexes from the sound track of a program (television or radio). Such a tool allows multimedia archives to be searched by keywords and corresponding fragments to be retrieved. This paper gives a report on the alignment module, which is one of th...
متن کامل